Message phishing detection using machine learning characterization

ABSTRACT

An email phishing detection mechanism is provided that utilizes machine learning algorithms. The machine learning algorithms are trained on phishing and non-phishing features extracted from a variety of data sets. Embodiments extract embedded URL-based and email body text-based feature sets for training and testing the machine learning algorithms. Embodiments determine the presence of a phishing message through a combination of examining an embedded URL and the body text of the message for the learned feature sets.

BACKGROUND Field

This disclosure relates generally to computer system security, and morespecifically, to a machine learning mechanism for characterization ofphishing communications.

Related Art

Users interact, on a daily basis, with physical, system, data, andservices resources of all kinds, as well as each other. Each of theseinteractions, whether accidental or intended, poses some degree ofsecurity risk, depending on the behavior of the user. External factorscan influence user behavior to open systems and services to maliciousattack.

Phishing attacks, for example, are social engineered attacks in whichthe attacker lures a victim to share sensitive information such asusername, password, bank account number, address, Social Securitynumber, and the like. Such attacks are carried out via emails, SMSmessaging, and chat messages. Emails are typically the most commonmedium to execute phishing attacks. Attackers can urge their victims toclick a link, fill in a form, or replied with sensitive information.When the victim clicks the link, the link takes them to a fraudulentsite or installs malware on the user's browser or system designed toextract sensitive information. The phishing attacks can then lead toidentity theft, loss of intellectual property or finances, or theft ofsecret information. According to one report, in a recent year, 32% ofdata breaches and 78% of cyber espionage, including installation and useof backdoors, involve phishing. The financial sector is the mostfrequent target of phishing attacks. In light of these issues, it isdesirable to have an efficient mechanism to identify and quarantinephishing-related attacks on an enterprise network.

SUMMARY OF THE INVENTION

An email phishing detection mechanism is provided that utilizes machinelearning algorithms. The machine learning algorithms are trained onphishing and non-phishing features extracted from a variety of datasets. Embodiments extract embedded URL-based and email body text-basedfeature sets for training and testing the machine learning algorithms.Embodiments determine the presence of a phishing message through acombination of examining an embedded URL and the body text of themessage for the learned feature sets.

In one embodiment, an information handling system configured as anelectronic mail server for an enterprise network is provided. Theinformation handling system includes a processor, a network interfacecoupled to the processor and communicatively coupled to the enterprisenetwork, and a first memory storing instructions executable by theprocessor. The instructions are configured to extract a uniform resourcelocator (URL) address embedded in an electronic mail (email) messagereceived by the network interface, determine whether the extracted URLincludes one or more features associated with a phishing URL, extractbody text from the email message, determine whether the extracted bodytext includes one or more features associated with a phishing emailmessage, and classify the email message as one of phishing or notphishing using the determinations associated with the extracted URL andthe extracted body text.

In one aspect of the above embodiment, the one or more featuresassociated with a phishing URL are predetermined by training amachine-learning URL classifier on one or more data sets including knownphishing URLs and known non-phishing URLs. In a further aspect, the oneor more features associated with the phishing URL include one or moreof: use of shortening services on the URL; an IP address; a URL ofgreater than 75 characters; an “@” symbol within the URL; multiple setsof double slashes within the URL; a prefix or suffix separator by hyphento a domain of the URL; one or more subdomains; and an “https” withinthe domain of the URL.

In another aspect of the above embodiment, the one or more featuresassociated with the phishing email message are pre-determined bytraining a machine-learning body text classifier on one or more datasets including known phishing emails and known non-phishing emails. In afurther aspect, the one or more features associated with the phishingemail message include one or more of: a general greeting; a lack ofrichness in vocabulary; a lack of similarity between a subject of theemail and the extracted body text; excessive use of pronoun references;and intent of the body text that is indicative of notification, urgency,action, consequence, and threat.

In yet another aspect of the above embodiment, the processor isconfigured to determine whether the extracted body text includes one ormore features associated with the phishing email message by beingfurther configured to perform natural language processing of theextracted body text to generate dependency and part-of-speech tagsassociated with sentences within the extracted body text, and use thetags to determine whether the extracted body text includes languageconstructs associated with an intent indicative of notification,urgency, action, consequence, and threat. In a further aspect, theprocessor is configured to determine whether the extracted body textincludes the language constructs by being further configured to matchthe tags with language construct information stored in a second memorycoupled to the processor in one or more dictionaries where the languageconstruct information stored in the dictionaries is pre-determinedduring training of a machine-learning body text classifier on one ormore data sets including known phishing emails and known non-phishingemails.

In another aspect of the above embodiment, the processor is configuredto classify the email message as one of phishing or not phishing bybeing further configured to: classify the email message as a phishingmessage when both the extracted URL phishing determination and theextracted body text determination indicate the message is phishing;classify the email message as a phishing message when one of theextracted URL phishing determination or the extracted body textdetermination has a probability of being phishing above a set threshold;and, classify the email message is not a phishing message when neitherof the extracted URL phishing determination or the extracted body textdetermination has a probability of being phishing above the setthreshold.

In still another aspect of the above embodiment, the instructionsexecutable by the processor are further configured to store the emailmessage in a third memory and transmit a quarantine message to arecipient of the email message where the quarantine message includes anotification that the email message has been quarantined and informationassociated with the email message. In yet another aspect of the aboveembodiment, the determining of whether the extracted URL includes one ormore features associated with a phishing URL and the determining ofwhether the extracted body text includes one or more features associatedwith a phishing email message are both performed by an associatedmachine-learning classifier.

Another embodiment provides a method for identifying phishing emailmessages. The method includes: receiving, at a network interface coupledto an enterprise network, an email message; extracting a URL addressembedded in the email message; determining whether the extracted URLincludes one or more features associated with a phishing URL; extractingbody text from the email message; determining whether the extracted bodytext includes one or more features associated with a phishing emailmessage; and, classifying the email message as one of phishing or notphishing using the determinations associated with the extracted URL andthe extracted body text.

In one aspect of the above embodiment, the one or more featuresassociated with a phishing URL are pre-determined by training amachine-learning URL classifier on one or more data sets including knownphishing URLs and known non-phishing URLs. In a further aspect, the oneor more features associated with a phishing URL include one or more of:use of shortening services on the URL; an IP address; a URL of greaterthan 75 characters; an “@” symbol within the URL; multiple sets ofdouble slashes within the URL; a prefix or suffix separator by hyphen toa domain of the URL; one or more subdomains; and an “https” within thedomain of the URL.

In another aspect of the above embodiment, the one or more featuresassociated with a phishing email message are pre-determined by traininga machine-learning body text classifier on one or more data setsincluding known phishing emails and known non-phishing emails. In afurther aspect, the one or more features associated with the phishingemail message include one or more of: a general greeting; a lack ofrichness in vocabulary; a lack of similarity between a subject of theemail and the extracted body text; excessive use of pronoun references;and intent of the body text that is indicative of notification, urgency,action, consequence, and threat.

In another aspect of the above embodiment, said determining whether theextracted body text includes one or more features associated with thephishing email message further include performing natural languageprocessing of the extracted body text to generate dependency andpart-of-speech tags associated with sentences within the extracted bodytext, and determining whether the extracted body text includes languageconstructs associated with an intent indicative of notification,urgency, action, consequence, and threat, using the tags associated withthe sentences. In a further aspect, said determining whether theextracted body text includes the language constructs further includesmatching the tags with language construct information stored in a secondmemory coupled to the processor in one or more dictionaries where thelanguage construct information stored in the dictionaries ispredetermined during training of a machine-learning body text classifieron one or more data sets including known phishing emails and knownnon-phishing emails.

In yet another aspect of the above embodiment, said classifying theemail message as one of phishing or not phishing further includes:classify the email message is a phishing message when both the extractedURL phishing determination and the extracted body text determinationindicate the message is phishing; classify the email message as aphishing message 11 of the extracted URL phishing determination or theextracted body text or termination as a probability of being phishingabove a set threshold; and, classify the email message that is not aphishing message when neither of the extracted URL phishingdetermination or the extracted body text determination has a probabilityof being phishing above the set threshold. In yet another aspect, themethod further includes storing the email message in a third memory andtransmitting a quarantine message to a recipient of the email messageusing the network interface where the quarantine message includes anotification that the email message has been quarantined and informationassociated with the email message.

Another embodiment provides an information handling system configured toexamine communications incoming to an enterprise network for phishingcommunications. The information handling system includes: a processor; anetwork interface, coupled to the processor, and communicatively coupledto the enterprise network; a first memory storing instructionsexecutable by the processor. The instructions are configured to extracta URL address embedded in an incoming communication message received bythe network interface, determine whether the extracted URL includes oneor more features associated with the phishing URL, extracted body textfrom the incoming communications message, determine whether theextracted body text includes one or more features associated withphishing communication, and classify the incoming communication messageas one of phishing or not phishing using the determinations associatedwith the extracted URL and the extracted body text.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention may be better understood byreferencing the accompanying drawings.

FIG. 1 is a simplified block diagram illustrating a network environmentincorporating an email server configurable to implement phishingcharacterization of emails in accordance with embodiments of the presentinvention.

FIG. 2 is a generalized illustration of an information handling systemthat can be used to implement the system and method of the presentinvention.

FIG. 3 is a simplified functional block diagram illustrating an exampleelectronic mail system information handling system configured inaccordance with embodiments of the present invention.

FIG. 4 is a simplified functional block diagram illustrating an exampleof a phishing filter module configured in accordance with embodiments ofthe present invention.

FIG. 5 is a simplified flow diagram illustrating steps performed fortopic discovery using Latent Dirichlet Allocation.

FIG. 6 is a diagram illustrating concepts associated with naturallanguage processing, utilized by embodiments of the present invention.

FIG. 7 is a simplified flow diagram illustrating an example of a method700 for performing phishing classification of an incoming email, inaccordance with an embodiment of the present invention.

The use of the same reference symbols in different drawings indicatesidentical items unless otherwise noted. The figures are not necessarilydrawn to scale.

DETAILED DESCRIPTION

Embodiments of the present invention provide an email phishing detectionmechanism that utilizes machine learning algorithms. The machinelearning algorithms are trained on phishing and non-phishing featuresextracted from a variety of data sets. Embodiments extract embeddedURL-based and email body text-based feature sets for training andtesting the machine learning algorithms.

Phishing emails are unwanted emails that are engineered in a way totrick a user's brain into opening and acting on them. One study hasshown that 1 in 10 people open an email attachment when they have noidea what the attachment relates to, and about half of phishing victimsopen an attachment or click a URL link within less than an hour of thephishing campaign launch, or a phishing email reaching an inbox.

Phishing emails are dynamic—changing every quarter or every season(e.g., tax season and holiday seasons). In order to react to thechanging nature of phishing emails, a machine learning approach can bemore flexible and able to react to catch and block phishing emails astheir character changes. Traditionally a static, rule-based emaildetection has been used, which makes no differentiation between spam andphishing. Traditional approaches treat spam and phishing the same, whichcan miss an average of 1-2% of phishing emails.

During the development of embodiments of the present invention, a commonpattern in phishing emails was discovered. This pattern illustrates thata phishing attacker informs the user that some activity has happened(e.g., a quota limit exceeded, unusual activity detected on an account,and the like) followed by an action and a consequence that the user hasto take or perform (e.g., revalidate by clicking on a link, send backsensitive information, and the like), and followed by a threat that theuser should avoid (e.g., your account will be deactivated or closed, oryou won't be able to send or receive new emails, and the like).

These phishing emails are further crafted in a manner that pushes a userto click on a URL link or do as directed in the email. Additionally, theemails can have a link to a malicious site. These malicious sites may beonly a few hours old, or the address may redirect to a different domain,or there may be an “https” buried in the domain name.

Embodiments of the present invention provide a mechanism to identify thecombination of characteristics of a phishing email: (1) body text of theemail (e.g., inform, action, consequence, and threat), and (2) URL(domain age, domain registration, IP number within the URL, and thelike). This identification mechanism is a machine learning algorithmthat provides a powerful email phishing detection tool.

FIG. 1 is a simplified block diagram illustrating a network environment100 incorporating an email server configurable to implement phishingcharacterization of emails in accordance with embodiments of the presentinvention. A wide area network 105 is a source of incomingcommunications to an enterprise network 110. The incoming communicationscan include, for example, electronic mail, website communications,messaging, file data, and application data. A firewall 115 separateswide area network 105 from enterprise network 110, providingcommunications filtering such as antivirus filtering 120 andanti-intrusion filtering 125. Firewall 115 can be implemented by avariety of network nodes, including, for example, a router or a bridgeconfigured to implement filtering and router protocols.

Within enterprise network 110 there can be a variety of networkconnected servers, providing data services to users both internal andexternal to the enterprise network. Examples of network-connectedservers can include, for example, file servers 130, email server 135,instant messaging server 140, web server 145, and application servers150. Each of these network-connected servers is configured to provideclients access to one or more applications and associated data via oneor both of an internal network 155 of the enterprise or wide areanetwork 105. Internal network 155 couples the various network connectedservers to clients 160.

The network connected servers and clients each can be provided by one ormore information handling systems coupled to the networks. In additionto the information handling systems configured as network nodes withinthe enterprise network, the environment can include cloud-based systemsthat are nodes on networks external to the enterprise network butserving data for which access is desired to be controlled.

FIG. 2 is a generalized illustration of an information handling system200 that can be used to implement the systems and methods associatedwith the present invention. Information handling system 200 includes aprocessor (e.g., central processor unit or “CPU”) 202, input/output(I/O) devices 204, such as a display, a keyboard, a gesture inputdevice, and associated controllers, a storage system 206, and variousother subsystems 208. In various embodiments, the information handlingsystem 200 also includes network port 210 operable to connect to anetwork 240, which is likewise accessible by a service provider server242. The information handling system 200 likewise includes system memory212, which is interconnected to the foregoing via one or more buses 214.System memory 212 further includes operating system (OS) 216 and invarious embodiments may also include an electronic mail system 218. Inone embodiment, the information handling system 200 is able to downloadthe electronic mail system 218 from the service provider server 242. Inanother embodiment, electronic mail system 218 is provided as a servicefrom the service provider server 242.

For the purposes of this disclosure, an information handling system mayinclude any instrumentality or aggregate of instrumentalities operableto compute, classify, process, transmit, receive, retrieve, originate,switch, store, display, manifest, detect, record, reproduce, handle, orutilize any form of information, intelligence, or data for business,scientific, control, entertainment, or other purposes. For example, aninformation handling system may be a personal computer, a mobile devicesuch as a tablet or smartphone, a consumer electronic device, aconnected “smart device,” a network appliance, a network storage device,a network gateway device, a server or collection of servers or any othersuitable device and may vary in size, shape, performance, functionality,and price. The information handling system may include volatile and/ornon-volatile memory, and one or more processing resources such as acentral processing unit (CPU) or hardware or software control logic.Additional components of the information handling system may include oneor more storage systems, one or more wired or wireless interfaces forcommunicating with other networked devices, external devices, andvarious input and output (I/O) devices, such as a keyboard, a gestureinput device (e.g., mouse, trackball, trackpad, touchscreen, and touchsensitive display device), a microphone, speakers, a track pad, atouchscreen and a display device (including a touch sensitive displaydevice). The information handling system may also include one or morebuses operable to transmit communication between the various hardwarecomponents.

For the purposes of this disclosure, computer-readable media may includeany instrumentality or aggregation of instrumentalities that may retaindata and/or instructions for a period of time. Computer-readable mediamay include, without limitation, storage media such as a direct accessstorage device (e.g., a hard disk drive or solid state drive), asequential access storage device (e.g., a tape disk drive), opticalstorage device, random access memory (RAM), read-only memory (ROM),electrically erasable programmable read-only memory (EEPROM), and/orflash memory; as well as communications media such as wires, opticalfibers, microwaves, radio waves, and other electromagnetic and/oroptical carriers; and/or any combination of the foregoing.

In various embodiments, electronic mail system 218 performs operationsassociated with email reception, filtering/analysis, and delivery. Aswill be appreciated, once information handling system 200 is configuredto perform the electronic mail operations, the information handlingsystem becomes a specialized computing device specifically configured toperform electronic mail operations and is not a general-purposecomputing device. Moreover, the implementation of electronic mail system218 on the information handling system 200 improves the functionality ofthe information handling system 200 and provides a useful and concreteresult of performing electronic mail functions and functions associatedwith phishing detection module 220 to mitigate security risk. Inembodiments of the present invention, electronic mail system 218 isimplemented to include a phishing detection module 220. In certainembodiments, phishing detection module 220 may be implemented to performthe various phishing detection operations described in greater detailherein. It should be noted that embodiments are not limited to detectingphishing communications within an electronic mail message but can alsodetect phishing in other incoming communication vectors to an enterprisenetwork, such as, for example, instant messaging, texts, and the like.The phishing detection modules discussed herein can be included as apart of a server managing these types of incoming communication vectors.

FIG. 3 is a simplified functional block diagram illustrating an exampleelectronic mail system 218 information handling system configured inaccordance with embodiments of the present invention. Inbound email 305is received by electronic mail system 218 at receiver module 310. Uponreception of an incoming email, an address parser 315 can examine theintended recipient address to determine whether such an address isserved by the electronic mail system. If no such address is served bythe electronic mail system, then return message module 320 can generatea return message for the sender of the original incoming email andprovide that to transmitter module 325 which can, in turn, providereturn email 330 to the network. If the incoming email passes theaddress parser testing, the message can be provided to a content rulesmodule 335 that can determine whether the email and attachments theretoconform to basic rules of the enterprise network for delivery (e.g.,size, data protocol, and the like). As with the address parser, if theincoming electronic mail does not conform to the content rules, a returnmessage can be generated by return message module 320.

If the incoming electronic mail passes tests of both the address parserand content rules, the electronic mail can be provided to a quarantinerules module 340. Quarantine rules module 340 performs a variety ofcontent analyses of the incoming electronic mail to determine whetherthe content is potentially harmful (e.g. phishing) or spam typemessaging, for example. A spam filtering module 345 and a phishingcharacterization module 350 can be provided within quarantine rulesmodule 342 perform the characterization and filtering analyses. Shouldan electronic mail message not clear the spam or phishing filters, themessage can be stored by a quarantine module 355 and a message,informing the recipient that a quarantined electronic mail has beenreceived, can be generated by a message generator 360. If the electronicmail message clears the various quarantine filters, then the email canbe provided to the addressee via transmit module 365.

FIG. 4 is a simplified functional block diagram illustrating an exampleof a phishing filter module 350 configured in accordance withembodiments of the present invention. As discussed above, it wasdiscovered that a common pattern in phishing emails exists in both thebody text of the phishing email and in embedded URL links provided inthe email toward which a user is encouraged to click. Phishing filter350 is configured to perform analysis of both the URL and the body text.Specifically, a classification is performed using a trained machinelearning classifier for each of the URL and body text.

An incoming email 405 is first provided to a module that performsextraction of an embedded URL 410, should there be such a URL. The URLextraction module examines the email for a clickable link and providesthat URL to a URL machine learning classifier 420. As will be discussedin greater detail below, URL machine learning classifier 420 examinesthe extracted URL for characteristics associated with the types of linkscommonly found in phishing emails. These characteristics are trainedinto the machine learning classifier using data sets of known phishingemails and URLs known to be malicious. To classify a link in an email asmalicious, URL-based features typically fall into the following threecategories: address bar, domain, and website. The combination of thesefeatures enables the ability to distinguish friend phishing andlegitimate sites.

In addition to URL extraction, the email is subjected to a body textextraction by a body text extraction module 430. The extracted body textis provided to a trained body text machine learning classifier 440.Again, as will be discussed in greater detail below, body text machinelearning classifier 440 examines the extracted body text forcharacteristics associated with types of text commonly found withinphishing emails. The body text of a phishing email typically contains aunique and repetitive pattern pushing or urging a victim to click on alink or send personal information.

Classification of an embedded URL link along with the body text of theemail are performed to provide a quantitative value for the probabilityof the URL and body text being related to phishing. The probabilityvalues can then be provided to a phishing characterization module 450that can combine the classifications using multiple if-else statementsto determine whether that combination is indicative of a phishing emailor meets or exceeds a threshold suggesting a phishing email. Should thedetermination be that the email is phishing related, then the phishingemail can be quarantined by quarantine module 460. Quarantine module 460can save the offending email in a quarantine database 470 and thentransmit an informational message to the recipient indicating that theoffending email has been quarantined and providing sufficientinformation for the recipient to identify the email. Should the emailnot be determined to be phishing, then the email can be provided to therecipient via a transmission module 470.

To train a machine learning classifier, features were extracted fromURLs and email bodies. These features were identified from variousliterature and research on the email body itself. As discussed above,two types of feature sets, URL and email body, were used to detectphishing emails with or without a malicious link. The two types offeature sets or combinations thereof are effective in detecting phishingand legitimate emails.

Embodiments utilize a Random Forest supervised learning model to doemail classification. Random Forest is a group of multiple decisiontrees. The trees are trained on a random subset of the training data.For classification, the Random Forest model votes on the output of allthe decision trees to make a final prediction. During evaluation, RandomForest had better accuracy compared to other classifiers, such asdecision trees and logic regression.

To train the Random Forest classifier requires labeled data, for bothphishing and legitimate URLs and email bodies. The URL classifier wastrained using a large set of known phishing URLs and a correspondinglarge set of known legitimate URLs. The body text classifier wassimilarly trained using a large set of known phishing emails and a largeset of known legitimate emails.

In order to accurately train the Random Forest, the datasets werecleaned to ensure that only accurate phishing-related information waspresent. For example, one original body text data set was a collectionof phishing and spam emails. To build a robust phishing classifier, theclassifier needed to be trained only on a phishing data set that did notinclude spam emails.

Spam and phishing emails have different purpose, approach, and textcontent. To remove spam from the body text data set, a Latent DirichletAllocation (LDA) topic modelling technique was employed. LDA describes adocument as a mixture of a small number of topics and that each word'spresence is attributable to one of the document's topics. LDA therebycan discover the topics that best describe a set of documents.

FIG. 5 is a simplified flow diagram illustrating steps performed fortopic discovery using LDA. Initially, data preprocessing (510) isperformed. Such pre-processing includes, for example, removing stopwordsand performing lemmatization. Stopwords include words such as, forexample, “is,” “the,” “am,” and “an.” Lemmatization involves replacing aword with its root word (e.g., “connection” replaced with “connect”).Once data pre-processing is performed, then a tokenization process (520)is performed, which converts sentences in the body text to individualwords. Phrase modeling (530) is also performed. Bigrams were used forone embodiment of phrase modeling. An N-gram is a sequence of N words.Thus, a bigram is a two-word sequence of words such as “please turn,”“turn your,” “New York,” “suspicious activity,” and “your account.”

Topic modeling (540) is then performed. During topic modeling, a uniqueidentifier is assigned to each word to form a document vocabulary. Theneach document in the dataset is converted into a corpus format thatreflects word identifiers and the frequency of each word in thedocument. The LDA model then receives the document vocabulary and corpusto generate topics, as discussed above. In one embodiment, the number oftopics is specified in advance and a list of topics generated. Arepresentative document is determined (550) for each topic. This can aidin determining the nature of the topic. In the instance of the body textdataset, this allows for differentiation between phishing and spamemails. Labeling (560) is then performed to indicate the most dominanttopic associated with each document in the dataset, thereby assigning atopic to every document. Documents are then sorted according to theirlabels to retain the documents with topics associated with phishingemails for the training dataset (570).

An example of topics, with the top 10 words associated with the topic,and a label (e.g., spam/phish) are provided in Table 1 below.

TABLE 1 Topic Top 10 words Label 0 face, company, account, key, meet,include, email, logistic, day, business Phish 1 payment, united,compensation, nation, contact, fund, delivery, address, email, cardPhish 2 card, atm, delivery, contact, address, payment, bank, company,pay, visa Phish 3 bank, fund, account, payment, transfer, receive,state, federal, address, united Phish 4 money, fund, business, bank,late, know, help, good, contact want″ Phish 5 good, regard, life,change, happen, world, believe, think, time, positive” Spam 6 email,claim, lottery, win, number, winner, address, prize, reply, draw Phish 7fund email, receive, contact, information, money, send, address, mr,number Phish 8 fund, payment, Nigeria, address, receive, office, money,western union, information, Phish know 9 fund, usd, email, help, money,know, information, charity, mr, world Phish 10 consignment, box,address, united, state, airport, email, international, delivery, detailPhish 11 de, important, com, org, nz, net, info, uk, co, font Spam 12Account, email, mail, click, service, message, information, mailbox,update, customer Phish

The LDA algorithm was trained on a data set of approximately 22,000phishing and spam emails. The trained LDA model was used to predict themost likely topic for each email in the data set. When a topicsubsequently was manually identified as spam, emails associated withthat topic were discarded from the data set.

Once the data set for training body text classifier and the URLclassifier had been subject to data cleaning, features were extractedfrom the phishing email body texts and the URLs. Additionalpre-processing was performed on the email body text data set to checkthe emails for data content and language, for example. With regard todata content, the body text of an email required more than five wordswithin the email for valuable extraction. In addition, the trainingdataset was restricted to one language. In one embodiment, the trainingdataset was restricted to English. Any non-English language emails wererejected from the dataset for feature extraction.

The URL and body text classifiers were trained on 85% of the featuredata set and the remaining 15% of the data sets were used to compute theclassifiers accuracy. Decision tree, logistic regression, and random forus classifiers were compared before determining that a Random Forestclassifier perform better than both the alternative classifiers relatingto accuracy (e.g., (True Positives+True Negatives)/N), precision (e.g.,True Positives/(True Positives+False Positives)), and recall (e.g., TruePositives/(True Positives+False Negatives)) score.

As discussed above with regard to FIG. 4, the classifications generatedby URL machine learning classifier 420 and body text machine learningclassifier 440 are combined during phishing characterization 450. In oneembodiment, the URL and email Random Forest classifiers are merged usingmultiple if-else statements. For example, if both the classifierspredict a test case as positive, then the prediction is that the instantemail is phishing (a true positive). If the predictions from theclassifiers do not match, then the characterization is based on theprobability from the classifiers. If the probability of being phishingis higher for either classifier than a set threshold (e.g., probabilitygreater than 60%), then the instant email is labeled as phishing elselegitimate.

During body text training, a common pattern in the language of phishingemails was discovered. In the pattern of a typical phishing email, anattacker informs the user that some activity has happened, followed byan action and consequences that user has to take or perform, followed bya threat that the user should avoid. An example phishing email containedthe following text:

-   -   Your mailbox has exceeded the storage limit which was set by        your administrator, you may not be able to send or receive new        mail until you revalidate your mailbox. To revalidate your        mailbox, please send the following details below:    -   [Request for name, username, password, email address, and phone        number]    -   If you fail to revalidate your mailbox, your mailbox will be        deactivated!!!        Deciphering the above phishing email with regard to the        determined pattern of phishing emails reveals the following:    -   Inform: your mailbox has exceeded the storage limit which was        set by your administrator    -   Action: to revalidate your mailbox please send the following        details below    -   Threat: if you fail to revalidate your mailbox your mailbox will        be deactivated!!!    -   Consequence: you may not be able to send or receive new email        until you revalidate your mailbox.

In addition to the above, an intent sequence is also common in manyphishing emails. That is, the attacker unveils their intent in a certainsequence and the sequence is another feature that helps indistinguishing between phishing and legitimate emails. The observedsequences include, for example: inform, consequence, action, threat,consequence; inform, consequence, action, threat; and inform,consequence, action.

Further, typical phishing emails have an excessive use of pronoun termssuch as “you,” “your,” “our,” and “we.” A typical phishing attacker useswords such as “you” and “your” to trigger a panic response and make itpersonal for the reader of the phishing email (e.g., “this is to informyou that suspicious activity is observed on your account.”).

In order to develop a robust email classifier, embodiments areconfigured to extract information associated with the intentioninformation discussed above. With regard to the “inform” characteristic,text is identified with regard to who the attacker is informing andabout what the attacker is informing. With regard to the “action”characteristic, text is identified regarding what action the attacker isasking a user to perform, who must perform the action, and what entitythe action is intended for. With regard to the “consequence” and“threat” characteristics, text is identified regarding what theconsequence and required action for the consequence is, who will beaffected by the consequence, and what will be the consequence if thethreat is ignored. Finally, with regard to the “urgency” characteristic,text is identified regarding whether there is an urgency or convincingtone, and if there is an urgency what is the action associated with thaturgency.

FIG. 6 is a diagram illustrating concepts associated with naturallanguage processing, utilized by embodiments of the present invention.To extract the above features, natural language processing (NLP)techniques, such as dependency parsing and part-of-speech (POS) tagging,are used. FIG. 6 illustrates a set of dependency tags 610 for a sentence605 and a set of POS tags 615 for that sentence. Along with thedependency tags 610, are a set of connectors representing the syntacticdependencies or relationships between the words in sentence 605. Eachconnector has a head and a dependent word (e.g., for the portion of thesentence “send your information,” “send” is the head and “yourinformation” is the dependent.

Dependency parsing describes the relationship between a head word andits dependencies, and the relationship is written in lowercase in FIG.6. POS tagging assigns tags to the words within a sentence, like noun orverb, which are written in uppercase in FIG. 6. Using dependency parsingand POS tagging techniques, the most frequently used words to conveyintentions were identified during training. These words facilitate thecreation of multiple dictionaries which help in collecting features fortraining the Random Forest email body text classifier. Dependency parserkeywords found to be most useful in the formation of the dictionariesincluded adverbial clause modifiers (advcl), adverbial modifiers(advmod), direct objects (dobj), and nominal subjects (nsubj). POS tagsfound to be most useful in formation of the dictionaries included, forexample, verbs and nouns. In addition, the grammar structure of the text(e.g., left and right children of the keywords) was used to help inidentifying whether an attacker was conveying information or asking auser to take an action.

As one example of creation of the Inform characteristic dictionary,nominal subject (nsubj) and verbs (VERB) dependency and POS taggingkeywords are required. A dictionary was created associated with“informing about what?” This dictionary contains dependent words of thenominal subject dependency tag (e.g., mailbox, account, storage, andpassword). Another dictionary was created associated with informing theuser about an event or activity that has happened and what that eventis. This dictionary contains head and dependent word pairs of nominalsubject dependency tag and the dependent's left child and must meet thefollowing conditions: the left child is a NOUN word where the NOUN wordcan only be you, your, our, or we, and the words are in the samesentence (e.g., “your mailbox has exceeded the storage limit.”).

As an example of creation of an Action characteristic dictionary, directobject (dobj), verbs (VERB), and nouns (NOUN) dependency and POS taggingkeywords are required. A dictionary is created associated with “whataction an attacker is asking a user to perform.” This dictionarycontains head words of direct object (dobj) syntactic dependency tag,where a head word is a verb (VERB) (e.g., click, send, update, upgrade,confirm, and the like). Further, an action related informationdictionary is generated, such as on what entity, whose entity is that,and who must perform the action. This dictionary contains head anddependent word pairs of direct object (dobj) dependency tag and a leftchild of the dependent, where the head word is a verb (VERB) and theleft child of the dependent is a noun (NOUN). In addition, the noun canonly be “you” or “your” with the words in the same sentence (e.g., “foryour security we need you to validate this change to your account”).

As an example of creation of an Urgency characteristic dictionary,adverbial modifiers (advmod) and verbs (VERB) dependency and POS taggingkeywords are used. One dictionary is generated associated with whetherthere is an urgency or convincing tone detected in the email. Thisdictionary is made of dependent words of the adverbial modifierdependency tag (e.g., immediately, kindly, temporarily, quickly, and thelike). Another dictionary is generated associated with if there is anurgency, then what action is being asked to be performed. Thisdictionary includes head and dependent word pairs of the adverbialmodifier dependency tag, where the head must be a verb (e.g.,“immediately contact,” “kindly click,” and the like).

As an example of creation of a Consequence and Threat characteristicdictionary, adverbial clause modifiers (advcl), verbs (VERB), and nouns(NOUN) dependency and POS tagging keywords are used. One dictionary isgenerated associated with what action the user has to performed to avoidthe consequence. This dictionary contains dependent words of theadverbial clause modifier dependency tag (e.g., revalidate, confirm,validate, avoid, and the like). Another dictionary is generatedassociated with what actions or activity will be affected. Thisdictionary contains the head word associated with and adverbial clause,where the head is a verb and the adverbial clause is dependent word is aright child of the head word (e.g., “you may not be able to send orreceive new mail until you revalidate”). Yet another dictionary isgenerated associated with what will happen if the user does not performthe action (i.e., the threat). This dictionary contains an adverbialclause modifier's head and dependent word pair, in which both the headand dependent are verbs and the adverbial clause modifier's dependentword is a left child of the head word (e.g., “if you fail to revalidateyour mailbox, your mailbox will be deactivated”).

The dictionaries are used to collect email features to build the emailclassifier. These features reflect the percentage of intentions presentin the email, rather than just binary information (e.g., whetherintentional related words exist in the email or not). This processcreates a significantly more robust email classifier.

As discussed above, phishing emails usually include a URL link to amalicious site. Examination of the training datasets for URLs suggeststhat malicious URLs include features falling into three categories:address bar, domain, and website. Examples of such URL features areprovided in Table 2 below.

TABLE 2 CATEGORY DESCRIPTION EXAMPLE Address Bar Use of URLhxxps://bit.ly/32eMY7h shortening services such as bit.ly and tiny.ccAddress Bar IP address inhttp://161.53.205.196/sites/default/files/bankofamerica/login.php URLAddress Bar Long URLs, http://aceshiprecycling.com/wp- which hide thecgi/verify/chase.com/home/myaccount/vbv.php?websrc= suspicious parte17ea285ad581008aac6b89b49ab879e&amp;dispached=47&amp; of the URLid=1613955414 (e.g., greater than 75 characters) Address Bar Use of @ inhttps://bishopgat.xyz/@%23%25$@%23$%25@@/ URL Address Bar Location ofhttps://href.li/?https://www.danicathreesixty.com double slashes AddressBar Prefix or suffix http://www.www-paypal.info/us/cgi-bin/webscrc=login %20run separator by hyphen to domain Address Bar Subdomainhttp://paypal.co.uk.w97s.top/ and multi- subdomain Address Bar https ina URL http://https-security.000webhostapp.com domain Domain DomainRegistration of less than a year is indicative of phishing registrationlength Domain Domain age Domain age of less than half a year isindicative of phishing Domain Alexa ranking Check whether domain ispresent in top 1 million Alexa ranking else suspicious Domain SuspiciousCheck whether domain is present in phishtank database domain DomainMissing DNS record Website Excessive number of redirects; Server formhandler either empty or refers to a different domain; Website sendsinformation to be filled by the user to an email; Link and anchor tagspointing to different domain; Invalid SSL certificate

Embodiments are not limited to the URL classification above. URLclassifiers can also include, for example, branding classification andcredential validation to determine whether a URL is legitimate orphishing related.

Once the URL and body text machine learning classifiers are trained,they can be used in the process of determining whether incoming messages(e.g., inbound email 305) containing phishing-related content.

FIG. 7 is a simplified flow diagram illustrating an example of a method700 for performing phishing classification of an incoming email, inaccordance with an embodiment of the present invention. The method canbe performed by, for example, phishing filter 350 in electronic mailsystem 218 executed by an information processing system. The email isreceived (705) at the phishing filter, for example, subsequent to havinghad address parsing (e.g., by address parser 315) and content checking(e.g., by content rules module 335).

An initial check is made to determine whether there is a URL embedded inthe email text (710). If there is a URL present, then the URL isextracted from the email message (715) and subjected to URLclassification (720), as discussed above. The URL is examined by amachine learning classifier to determine whether the URL features (e.g.,Table 2) discovered during training of the classifier are present in theextracted URL. A URL-associated phishing classification is generated.The classification result can be either a legitimate URL (score=0),phishing URL (score=1), or a suspicious URL (score=−1), the quantity ofwhich depends upon the number and types of URL features present in theextracted URL.

As discussed above, phishing classification also involves severalanalyses of the body text of the email. As part of this process, thebody text is extracted from the email (725). A set of analyses are thenperformed. The greeting of the email is analyzed (730) to determinewhether the greeting is general or specific for the recipient. Forexample, determinations can be made as to whether the recipient's nameis missing and whether the greeting is generalized or to an emailaddress rather than an individual's name (e.g., “Greetings,” “Dear,”“Dear Valued Customer,” “My Dear Beneficiary,” “Dear Son of God,” and“Hi xxx@xyz.com”). The subject of the email can also be analyzed (735)to determine whether there is any similarity between the email subjectand the body content (e.g., shared words or phrases). Pronoun referencescan also be analyzed (740) by counting the number of such pronouns(e.g., you, your, our, and we). Excessive use of these pronounreferences may be indicative of a phishing email. Each of these analysescan contribute to a percentage chance that the email is phishingrelated.

In preparation for the body text to be analyzed for the featuresdiscovered during training of the machine learning classifier, naturallanguage processing (745) is performed. The natural language processingis similar to that discussed above with regard to initial naturallanguage processing involved with the training data set. Dependencyparsing and part-of-speech tagging are performed to identify featureswithin the body text of the email. Using the processed body text, theintent of the email is determined (750). The intent of the email isassessed using the features discussed above with regard to notifying,urgency, action, consequence, and threat. The assessed intent of thebody text contributes to a percentage chance that the email is phishingrelated.

Each of the body text classification factors (e.g., greeting, subject,references, and intent) are used by the machine learning process todetermine a numerical determination that the email body text isassociated with phishing (e.g., a probability of phishing=0.9, whilenon-phishing=0.1) (755). Subsequently, an overall phishing determinationis made (760) that combines the body text classification with the URLclassification. If the overall phishing determination indicates that themessage is likely phishing related (e.g., probability greater than apredetermined threshold) (765), then the email content is quarantined(770) (e.g., by a quarantine module 355). A message can be generated forthe intended recipient, indicating that a likely phishing-relatedmessage was received for the recipient. If the overall phishingdetermination indicates that the message is likely not phishing related(765), then the email can be transmitted to the recipient (775) orsubjected to other quarantine-associated or filter related tasks (e.g.,a spam determination).

In one embodiment, the overall phishing determination (760) is performedby merging the URL classification and body text classification usingmultiple if-else statements, as with the training and testing processdescribed above. Thus, if both the classifiers predict the email aspositive for phishing, then the prediction is phishing (a true positive.If both the classifiers do not match, then the prediction is based uponthe probabilities generated by the classifiers. If the probability ofbeing phishing is higher for either classifier than a set threshold(e.g., probability >60%) then the email is labeled as phishing.

Embodiments incorporating the Random Forest machine learning trainedclassifiers provide a highly accurate detection of phishing emails overtraditional methods. Embodiments tested detected between 0.5-2.0% morephishing emails compared to other email detection products. Given thenumber of emails typically received by a large enterprise network in aday, this can amount to thousands of phishing emails currently beingmissed and endangering users and their data. Embodiments of the presentinvention can fill this gap providing a benefit of more secure networksand data.

Because the apparatus implementing the present invention is, for themost part, composed of electronic components and circuits known to thoseskilled in the art, circuit details will not be explained in any greaterextent than that considered necessary as illustrated above, for theunderstanding and appreciation of the underlying concepts of the presentinvention and in order not to obfuscate or distract from the teachingsof the present invention.

The term “program,” as used herein, is defined as a sequence ofinstructions designed for execution on a computer system. A program, orcomputer program, may include a subroutine, a function, a procedure, anobject method, an object implementation, an executable application, anapplet, a servlet, a source code, an object code, a sharedlibrary/dynamic load library and/or other sequence of instructionsdesigned for execution on a computer system.

Some of the above embodiments, as applicable, may be implemented using avariety of different information processing systems. For example,although FIGS. 2 and 3 and the discussion thereof describe an exemplaryinformation processing architecture, this exemplary architecture ispresented merely to provide a useful reference in discussing variousaspects of the invention. Of course, the description of the architecturehas been simplified for purposes of discussion, and it is just one ofmany different types of appropriate architectures that may be used inaccordance with the invention. Those skilled in the art will recognizethat the boundaries between logic blocks are merely illustrative andthat alternative embodiments may merge logic blocks or circuit elementsor impose an alternate decomposition of functionality upon various logicblocks or circuit elements.

Thus, it is to be understood that the architectures depicted herein aremerely exemplary, and that in fact many other architectures can beimplemented which achieve the same functionality. In an abstract, butstill definite sense, any arrangement of components to achieve the samefunctionality is effectively “associated” such that the desiredfunctionality is achieved. Hence, any two components herein combined toachieve a particular functionality can be seen as “associated with” eachother such that the desired functionality is achieved, irrespective ofarchitectures or intermedial components. Likewise, any two components soassociated can also be viewed as being “operably connected,” or“operably coupled,” to each other to achieve the desired functionality.

Also for example, in one embodiment, the illustrated elements of system200 are circuitry located on a single integrated circuit or within asame device. Alternatively, system 200 may include any number ofseparate integrated circuits or separate devices interconnected witheach other. For example, memory 212 may be located on a same integratedcircuit as CPU 202 or on a separate integrated circuit or located withinanother peripheral or slave discretely separate from other elements ofsystem 200. Other subsystems 208 and I/O circuitry 204 may also belocated on separate integrated circuits or devices.

Furthermore, those skilled in the art will recognize that boundariesbetween the functionality of the above-described operations merelyillustrative. The functionality of multiple operations may be combinedinto a single operation, and/or the functionality of a single operationmay be distributed in additional operations. Moreover, alternativeembodiments may include multiple instances of a particular operation,and the order of operations may be altered in various other embodiments.

All or some of the software described herein may be received elements ofsystem 200, for example, from computer readable media such as memory 212or other media on other computer systems. Such computer readable mediamay be permanently, removably or remotely coupled to an informationprocessing system such as system 200. The computer readable media mayinclude, for example and without limitation, any number of thefollowing: magnetic storage media including disk and tape storage media;optical storage media such as compact disk media (e.g., CD-ROM, CD-R,etc.) and digital video disk storage media; nonvolatile memory storagemedia including semiconductor-based memory units such as FLASH memory,EEPROM, EPROM, ROM; ferromagnetic digital memories; MRAM; volatilestorage media including registers, buffers or caches, main memory, RAM,etc.; and data transmission media including computer networks,point-to-point telecommunication equipment, and carrier wavetransmission media, just to name a few.

In one embodiment, system 200 is a computer system such as a personalcomputer system. Other embodiments may include different types ofcomputer systems. Computer systems are information handling systemswhich can be designed to give independent computing power to one or moreusers. Computer systems may be found in many forms including but notlimited to mainframes, minicomputers, servers, workstations, personalcomputers, notepads, personal digital assistants, electronic games,automotive and other embedded systems, cell phones and various otherwireless devices. A typical computer system includes at least oneprocessing unit, associated memory and a number of input/output (I/O)devices.

A computer system processes information according to a program andproduces resultant output information via I/O devices. A program is alist of instructions such as a particular application program and/or anoperating system. A computer program is typically stored internally oncomputer readable storage medium or transmitted to the computer systemvia a computer readable transmission medium. A computer processtypically includes an executing (running) program or portion of aprogram, current program values and state information, and the resourcesused by the operating system to manage the execution of the process. Aparent process may spawn other, child processes to help perform theoverall functionality of the parent process. Because the parent processspecifically spawns the child processes to perform a portion of theoverall functionality of the parent process, the functions performed bychild processes (and grandchild processes, etc.) may sometimes bedescribed as being performed by the parent process.

Although the invention is described herein with reference to specificembodiments, various modifications and changes can be made withoutdeparting from the scope of the present invention as set forth in theclaims below. For example, while a Random Forest method is discussedwith regard to the machine learning classifiers, other types ofclassifiers can be used without departing from the scope of the presentinvention. Accordingly, the specification and figures are to be regardedin an illustrative rather than a restrictive sense, and all suchmodifications are intended to be included within the scope of thepresent invention. Any benefits, advantages, or solutions to problemsthat are described herein with regard to specific embodiments are notintended to be construed as a critical, required, or essential featureor element of any or all the claims.

Furthermore, the terms “a” or “an,” as used herein, are defined as oneor more than one. Also, the use of introductory phrases such as “atleast one” and “one or more” in the claims should not be construed toimply that the introduction of another claim element by the indefinitearticles “a” or “an” limits any particular claim containing suchintroduced claim element to inventions containing only one such element,even when the same claim includes the introductory phrases “one or more”or “at least one” and indefinite articles such as “a” or “an.” The sameholds true for the use of definite articles.

Unless stated otherwise, terms such as “first” and “second” are used toarbitrarily distinguish between the elements such terms describe. Thus,these terms are not necessarily intended to indicate temporal or otherprioritization of such elements.

What is claimed is:
 1. An information handling system configured as anelectronic mail server for an enterprise network, and comprising: aprocessor; a network interface, coupled to the processor, andcommunicatively coupled to the enterprise network; a first memory,storing instructions executable by the processor and configured toextract a uniform resource locator (URL) address embedded in anelectronic mail (email) message received by the network interface,determine whether the extracted URL comprises one or more featuresassociated with a phishing URL, extract body text from the emailmessage, determine whether the extracted body text comprises one or morefeatures associated with a phishing email message, and classify theemail message as one of phishing or not phishing using thedeterminations associated with the extracted URL and the extracted bodytext.
 2. The information handling system of claim 1, wherein the one ormore features associated with a phishing URL are pre-determined bytraining a machine-learning URL classifier on one or more datasetscomprising known phishing URLs and known non-phishing URLs.
 3. Theinformation handling system of claim 2, wherein the one or more featuresassociated with a phishing URL comprise one or more of: use ofshortening services on the URL; an Internet Protocol (IP) address; a URLof greater than 75 characters; an “@” symbol within the URL; multiplesets of double slashes within the URL; a prefix or suffix separated by ahyphen to a domain of the URL; one or more subdomains; and an “https”within the domain of the URL.
 4. The information handling system ofclaim 1, wherein the one or more features associated with a phishingemail message are pre-determined by training a machine-learning bodytext classifier on one or more datasets comprising known phishing emailsand known non-phishing emails.
 5. The information handling system ofclaim 4, wherein the one or more features associated with a phishingemail message comprise one or more of: a general greeting; a lack ofrichness in vocabulary; a lack of similarity between a subject of theemail and the extracted body text; excessive use of pronoun references;and intent of the body text that is indicative of notification, urgency,action, consequence, and threat.
 6. The information handling system ofclaim 1 wherein the processor is configured to determine whether theextracted body text comprises one or more features associated with aphishing email message by being further configured to: perform naturallanguage processing of the extracted body text to generate dependencyand part-of-speech tags associated with sentences within the extractedbody text; and use the tags to determine whether the extracted body textcomprises language constructs associated with an intent indicative ofnotification, urgency, action, consequence, and threat.
 7. Theinformation handling system of claim 6 wherein the processor isconfigured to determine whether the extracted body text comprises thelanguage constructs by being further configured to: match the tags withlanguage construct information stored in a second memory coupled to theprocessor in one or more dictionaries, wherein the language constructinformation stored in the dictionaries is pre-determined during trainingof a machine-learning body text classifier on one or more datasetscomprising known phishing emails and known non-phishing emails.
 8. Theinformation handling system of claim 1 wherein the processor isconfigured to classify the email message as one of phishing or notphishing by being further configured to: classify the email message as aphishing message when both the extracted URL phishing determination andthe extracted body text determination indicate the message is phishing;classify the email message as a phishing message when one of theextracted URL phishing determination or the extracted body textdetermination has a probability of being phishing above a set threshold;classify the email message as not a phishing message when neither of theextracted URL phishing determination or the extracted body textdetermination has a probability of being phishing above the setthreshold.
 9. The information handling system of claim 1 wherein theinstructions executable by the processor are further configured to storethe email message in a third memory; and transmit a quarantine messageto a recipient of the email message, wherein the quarantine messagecomprises a notification that the email message has been quarantined,and information associated with the email message.
 10. The informationhandling system of claim 1, wherein the determining whether theextracted URL comprises one or more features associated with a phishingURL and the determining whether the extracted body text comprises one ormore features associated with a phishing email message are bothperformed by an associated machine learning classifier.
 11. A method foridentifying phishing email messages, the method comprising: receiving,at a network interface coupled to an enterprise network, an electronicmail (email) message; extracting a uniform resource locator (URL)address embedded in the email message; determining whether the extractedURL comprises one or more features associated with a phishing URL;extracting body text from the email message; determining whether theextracted body text comprises one or more features associated with aphishing email message; and classifying the email message as one ofphishing or not phishing using the determinations associated with theextracted URL and the extracted body text.
 12. The method of claim 11wherein the one or more features associated with a phishing URL arepre-determined by training a machine-learning URL classifier on one ormore datasets comprising known phishing URLs and known non-phishingURLs.
 13. The method of claim 12, wherein the one or more featuresassociated with a phishing URL comprise one or more of: use ofshortening services on the URL; an Internet Protocol (IP) address; a URLof greater than 75 characters; an “@” symbol within the URL; multiplesets of double slashes within the URL; a prefix or suffix separated by ahyphen to a domain of the URL; one or more subdomains; and an “https”within the domain of the URL.
 14. The method of claim 11 wherein the oneor more features associated with a phishing email message arepre-determined by training a machine-learning body text classifier onone or more datasets comprising known phishing emails and knownnon-phishing emails.
 15. The method of claim 14, wherein the one or morefeatures associated with a phishing email message comprise one or moreof: a general greeting; a lack of richness in vocabulary; a lack ofsimilarity between a subject of the email and the extracted body text;excessive use of pronoun references; and intent of the body text that isindicative of notification, urgency, action, consequence, and threat.16. The method of claim 11 wherein said determining whether theextracted body text comprises one or more features associated with aphishing email message further comprises: performing natural languageprocessing of the extracted body text to generate dependency andpart-of-speech tags associated with sentences within the extracted bodytext; and determining whether the extracted body text comprises languageconstructs associated with an intent indicative of notification,urgency, action, consequence, and threat, using the tags associated withthe sentences.
 17. The method of claim 16 wherein said determiningwhether the extracted body text comprises the language constructsfurther comprises: matching the tags with language construct informationstored in a second memory coupled to the processor in one or moredictionaries, wherein the language construct information stored in thedictionaries is pre-determined during training of a machine-learningbody text classifier on one or more datasets comprising known phishingemails and known non-phishing emails.
 18. The method of claim 11 whereinsaid classifying the email message as one of phishing or not phishingfurther comprises: classifying the email message as a phishing messagewhen both the extracted URL phishing determination and the extractedbody text determination indicate the message is phishing; classifyingthe email message as a phishing message when one of the extracted URLphishing determination or the extracted body text determination has aprobability of being phishing above a set threshold; classifying theemail message as not a phishing message when neither of the extractedURL phishing determination or the extracted body text determination hasa probability of being phishing above the set threshold.
 19. The methodof claim 11 further comprising: storing the email message in a thirdmemory; and transmitting a quarantine message to a recipient of theemail message, using the network interface, wherein the quarantinemessage comprises a notification that the email message has beenquarantined, and information associated with the email message.
 20. Aninformation handling system configured to examine communicationsincoming to an enterprise network for phishing communications, andcomprising: a processor; a network interface, coupled to the processor,and communicatively coupled to the enterprise network; a first memorystoring instructions executable by the processor and configured toextract a uniform resource locator (URL) address embedded in an incomingcommunication message received by the network interface, determinewhether the extracted URL comprises one or more features associated witha phishing URL, extract body text from the incoming communicationmessage, determine whether the extracted body text comprises one or morefeatures associated with a phishing communication, and classify theincoming communication message as one of phishing or not phishing usingthe determinations associated with the extracted URL and the extractedbody text.